1 /*
2 * Copyright (C) 2011 The Guava Authors
3 *
4 * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
5 * in compliance with the License. You may obtain a copy of the License at
6 *
7 * http://www.apache.org/licenses/LICENSE-2.0
8 *
9 * Unless required by applicable law or agreed to in writing, software distributed under the License
10 * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
11 * or implied. See the License for the specific language governing permissions and limitations under
12 * the License.
13 */
14
15 package com.google.common.hash;
16
17 import com.google.common.annotations.Beta;
18 import com.google.common.primitives.Ints;
19
20 import java.nio.charset.Charset;
21
22 /**
23 * A hash function is a collision-averse pure function that maps an arbitrary block of
24 * data to a number called a <i>hash code</i>.
25 *
26 * <h3>Definition</h3>
27 *
28 * <p>Unpacking this definition:
29 *
30 * <ul>
31 * <li><b>block of data:</b> the input for a hash function is always, in concept, an
32 * ordered byte array. This hashing API accepts an arbitrary sequence of byte and
33 * multibyte values (via {@link Hasher}), but this is merely a convenience; these are
34 * always translated into raw byte sequences under the covers.
35 *
36 * <li><b>hash code:</b> each hash function always yields hash codes of the same fixed bit
37 * length (given by {@link #bits}). For example, {@link Hashing#sha1} produces a
38 * 160-bit number, while {@link Hashing#murmur3_32()} yields only 32 bits. Because a
39 * {@code long} value is clearly insufficient to hold all hash code values, this API
40 * represents a hash code as an instance of {@link HashCode}.
41 *
42 * <li><b>pure function:</b> the value produced must depend only on the input bytes, in
43 * the order they appear. Input data is never modified. {@link HashFunction} instances
44 * should always be stateless, and therefore thread-safe.
45 *
46 * <li><b>collision-averse:</b> while it can't be helped that a hash function will
47 * sometimes produce the same hash code for distinct inputs (a "collision"), every
48 * hash function strives to <i>some</i> degree to make this unlikely. (Without this
49 * condition, a function that always returns zero could be called a hash function. It
50 * is not.)
51 * </ul>
52 *
53 * <p>Summarizing the last two points: "equal yield equal <i>always</i>; unequal yield
54 * unequal <i>often</i>." This is the most important characteristic of all hash functions.
55 *
56 * <h3>Desirable properties</h3>
57 *
58 * <p>A high-quality hash function strives for some subset of the following virtues:
59 *
60 * <ul>
61 * <li><b>collision-resistant:</b> while the definition above requires making at least
62 * <i>some</i> token attempt, one measure of the quality of a hash function is <i>how
63 * well</i> it succeeds at this goal. Important note: it may be easy to achieve the
64 * theoretical minimum collision rate when using completely <i>random</i> sample
65 * input. The true test of a hash function is how it performs on representative
66 * real-world data, which tends to contain many hidden patterns and clumps. The goal
67 * of a good hash function is to stamp these patterns out as thoroughly as possible.
68 *
69 * <li><b>bit-dispersing:</b> masking out any <i>single bit</i> from a hash code should
70 * yield only the expected <i>twofold</i> increase to all collision rates. Informally,
71 * the "information" in the hash code should be as evenly "spread out" through the
72 * hash code's bits as possible. The result is that, for example, when choosing a
73 * bucket in a hash table of size 2^8, <i>any</i> eight bits could be consistently
74 * used.
75 *
76 * <li><b>cryptographic:</b> certain hash functions such as {@link Hashing#sha512} are
77 * designed to make it as infeasible as possible to reverse-engineer the input that
78 * produced a given hash code, or even to discover <i>any</i> two distinct inputs that
79 * yield the same result. These are called <i>cryptographic hash functions</i>. But,
80 * whenever it is learned that either of these feats has become computationally
81 * feasible, the function is deemed "broken" and should no longer be used for secure
82 * purposes. (This is the likely eventual fate of <i>all</i> cryptographic hashes.)
83 *
84 * <li><b>fast:</b> perhaps self-explanatory, but often the most important consideration.
85 * We have published <a href="#noWeHaventYet">microbenchmark results</a> for many
86 * common hash functions.
87 * </ul>
88 *
89 * <h3>Providing input to a hash function</h3>
90 *
91 * <p>The primary way to provide the data that your hash function should act on is via a
92 * {@link Hasher}. Obtain a new hasher from the hash function using {@link #newHasher},
93 * "push" the relevant data into it using methods like {@link Hasher#putBytes(byte[])},
94 * and finally ask for the {@code HashCode} when finished using {@link Hasher#hash}. (See
95 * an {@linkplain #newHasher example} of this.)
96 *
97 * <p>If all you want to hash is a single byte array, string or {@code long} value, there
98 * are convenient shortcut methods defined directly on {@link HashFunction} to make this
99 * easier.
100 *
101 * <p>Hasher accepts primitive data types, but can also accept any Object of type {@code
102 * T} provided that you implement a {@link Funnel Funnel<T>} to specify how to "feed" data
103 * from that object into the function. (See {@linkplain Hasher#putObject an example} of
104 * this.)
105 *
106 * <p><b>Compatibility note:</b> Throughout this API, multibyte values are always
107 * interpreted in <i>little-endian</i> order. That is, hashing the byte array {@code
108 * {0x01, 0x02, 0x03, 0x04}} is equivalent to hashing the {@code int} value {@code
109 * 0x04030201}. If this isn't what you need, methods such as {@link Integer#reverseBytes}
110 * and {@link Ints#toByteArray} will help.
111 *
112 * <h3>Relationship to {@link Object#hashCode}</h3>
113 *
114 * <p>Java's baked-in concept of hash codes is constrained to 32 bits, and provides no
115 * separation between hash algorithms and the data they act on, so alternate hash
116 * algorithms can't be easily substituted. Also, implementations of {@code hashCode} tend
117 * to be poor-quality, in part because they end up depending on <i>other</i> existing
118 * poor-quality {@code hashCode} implementations, including those in many JDK classes.
119 *
120 * <p>{@code Object.hashCode} implementations tend to be very fast, but have weak
121 * collision prevention and <i>no</i> expectation of bit dispersion. This leaves them
122 * perfectly suitable for use in hash tables, because extra collisions cause only a slight
123 * performance hit, while poor bit dispersion is easily corrected using a secondary hash
124 * function (which all reasonable hash table implementations in Java use). For the many
125 * uses of hash functions beyond data structures, however, {@code Object.hashCode} almost
126 * always falls short -- hence this library.
127 *
128 * @author Kevin Bourrillion
129 * @since 11.0
130 */
131 @Beta
132 public interface HashFunction {
133 /**
134 * Begins a new hash code computation by returning an initialized, stateful {@code
135 * Hasher} instance that is ready to receive data. Example: <pre> {@code
136 *
137 * HashFunction hf = Hashing.md5();
138 * HashCode hc = hf.newHasher()
139 * .putLong(id)
140 * .putBoolean(isActive)
141 * .hash();}</pre>
142 */
143 Hasher newHasher();
144
145 /**
146 * Begins a new hash code computation as {@link #newHasher()}, but provides a hint of the
147 * expected size of the input (in bytes). This is only important for non-streaming hash
148 * functions (hash functions that need to buffer their whole input before processing any
149 * of it).
150 */
151 Hasher newHasher(int expectedInputSize);
152
153 /**
154 * Shortcut for {@code newHasher().putInt(input).hash()}; returns the hash code for the given
155 * {@code int} value, interpreted in little-endian byte order. The implementation <i>might</i>
156 * perform better than its longhand equivalent, but should not perform worse.
157 *
158 * @since 12.0
159 */
160 HashCode hashInt(int input);
161
162 /**
163 * Shortcut for {@code newHasher().putLong(input).hash()}; returns the hash code for the
164 * given {@code long} value, interpreted in little-endian byte order. The implementation
165 * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
166 */
167 HashCode hashLong(long input);
168
169 /**
170 * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation
171 * <i>might</i> perform better than its longhand equivalent, but should not perform
172 * worse.
173 */
174 HashCode hashBytes(byte[] input);
175
176 /**
177 * Shortcut for {@code newHasher().putBytes(input, off, len).hash()}. The implementation
178 * <i>might</i> perform better than its longhand equivalent, but should not perform
179 * worse.
180 *
181 * @throws IndexOutOfBoundsException if {@code off < 0} or {@code off + len > bytes.length}
182 * or {@code len < 0}
183 */
184 HashCode hashBytes(byte[] input, int off, int len);
185
186 /**
187 * Shortcut for {@code newHasher().putUnencodedChars(input).hash()}. The implementation
188 * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
189 * Note that no character encoding is performed; the low byte and high byte of each {@code char}
190 * are hashed directly (in that order).
191 *
192 * @since 15.0 (since 11.0 as hashString(CharSequence)).
193 */
194 HashCode hashUnencodedChars(CharSequence input);
195
196 /**
197 * Shortcut for {@code newHasher().putString(input, charset).hash()}. Characters are encoded
198 * using the given {@link Charset}. The implementation <i>might</i> perform better than its
199 * longhand equivalent, but should not perform worse.
200 */
201 HashCode hashString(CharSequence input, Charset charset);
202
203 /**
204 * Shortcut for {@code newHasher().putObject(instance, funnel).hash()}. The implementation
205 * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
206 *
207 * @since 14.0
208 */
209 <T> HashCode hashObject(T instance, Funnel<? super T> funnel);
210
211 /**
212 * Returns the number of bits (a multiple of 32) that each hash code produced by this
213 * hash function has.
214 */
215 int bits();
216 }